CS498PS - Lab 4: 3D Audio

In this lab we will learn how to create 3D sounds for headphone playback. We will make use of simple filters and HRTFs to create static and moving sources. Use the three sounds fly.wav, helicopter.wav, and crumble.wav in the lab archive as sources for the 3D recording that you will create.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import read
from scipy import signal
from copy import deepcopy
import math

def sound( x, rate=8000, label=''):
    from IPython.display import display, Audio, HTML
    if label is '':
        display( Audio( x, rate=rate))
    else:
        display( HTML( 
        '<style> table, th, td {border: 0px; }</style> <table><tr><td>' + label + 
        '</td><td>' + Audio( x, rate=rate)._repr_html_()[3:] + '</td></tr></table>'
        ))
        
def convolve_stereo(x, y, mode):
    return [np.convolve(x[0], y[0], mode=mode), np.convolve(x[1], y[1], mode=mode)]

sr_h, data_h = read("helicopter.wav")
sr_f, data_f = read("fly.wav")
sr_c, data_c = read("crumble.wav")

stereo_data_h = [data_h, data_h]
stereo_data_f = [data_f, data_f]
stereo_data_c = [data_c, data_c]

sound( stereo_data_h, rate=sr_h, label='helicopter.wav')
sound( stereo_data_f, rate=sr_f, label='fly.wav')
sound( stereo_data_c, rate=sr_c, label='crumble.wav')
helicopter.wav
fly.wav
crumble.wav

Part 1: Static sources using ITD/ILD cues

Assume the following source locations:

  • Straight ahead

  • 45 degrees to the left

  • 80 degrees to the right

  • 160 degrees to the left

For each location find the source’s delay between the two ears (assume a source distance of 2 meters), and design two filters that will simulate that ITD and ILD features (feel free to round the IDT delays to an integer sample size). Assume that when sounds come from the side of the head the attenuation at the contralateral ear is by a factor of 0.7. From sounds coming medial plane (between the ears) there will be no attenuation due to the head. For positions moving from the medial plane towards the sides you can interpolate between no attenuation and a factor of 0.7. Design and plot the filters that correspond to the locations shown above and use them to make 3D sounds with the following sounds:

Listen to the result through headphones and verify that they sound somewhat localized (it won’t sound perfect, but it should be believable).

In [30]:
straight_ahead_filter = np.zeros((2, 2048), dtype=float)
straight_ahead_filter[0][0] = 1.0
straight_ahead_filter[1][0] = 1.0

left_45_filter_ITD = np.zeros((2, 2048), dtype=float)
left_45_filter_ITD[0][0] = 1.0
left_45_filter_ITD[1][int(((np.sin((45 / 180) * np.pi) * 0.2) / 340) * sr_h)] = 1.0

right_80_filter_ITD = np.zeros((2, 2048), dtype=float)
right_80_filter_ITD[0][int(((np.sin((80 / 180) * np.pi) * 0.2) / 340) * sr_h)] = 1.0
right_80_filter_ITD[1][0] = 1.0

left_160_filter_ITD = np.zeros((2, 2048), dtype=float)
left_160_filter_ITD[0][0] = 1.0
left_160_filter_ITD[1][int(((np.sin((160 / 180) * np.pi) * 0.2) / 340) * sr_h)] = 1.0

left_45_filter_IID = np.zeros((2, 2048), dtype=float)
left_45_filter_IID[0][0] = 1.0
left_45_filter_IID[1][0] = 0.85

right_80_filter_IID = np.zeros((2, 2048), dtype=float)
right_80_filter_IID[0][0] = 0.7
right_80_filter_IID[1][0] = 1.0

left_160_filter_IID = np.zeros((2, 2048), dtype=float)
left_160_filter_IID[0][0] = 1.0
left_160_filter_IID[1][0] = 0.97

left_45_filter_both = np.zeros((2, 2048), dtype=float)
left_45_filter_both[0][0] = 1.0
left_45_filter_both[1][int(((np.sin((45 / 180) * np.pi) * 0.2) / 340) * sr_h)] = 0.85

right_80_filter_both = np.zeros((2, 2048), dtype=float)
right_80_filter_both[0][int(((np.sin((80 / 180) * np.pi) * 0.2) / 340) * sr_h)] = 0.7
right_80_filter_both[1][0] = 1.0

left_160_filter_both = np.zeros((2, 2048), dtype=float)
left_160_filter_both[0][0] = 1.0
left_160_filter_both[1][int(((np.sin((160 / 180) * np.pi) * 0.2) / 340) * sr_h)] = 0.97

helicopter_straight = convolve_stereo(straight_ahead_filter, stereo_data_h, mode='full')
helicopter_left_45_ITD = convolve_stereo(left_45_filter_ITD, stereo_data_h, mode='full')
helicopter_right_80_ITD = convolve_stereo(right_80_filter_ITD, stereo_data_h, mode='full')
helicopter_left_160_ITD = convolve_stereo(left_160_filter_ITD, stereo_data_h, mode='full')
helicopter_left_45_IID = convolve_stereo(left_45_filter_IID, stereo_data_h, mode='full')
helicopter_right_80_IID = convolve_stereo(right_80_filter_IID, stereo_data_h, mode='full')
helicopter_left_160_IID = convolve_stereo(left_160_filter_IID, stereo_data_h, mode='full')
helicopter_left_45_both = convolve_stereo(left_45_filter_both, stereo_data_h, mode='full')
helicopter_right_80_both = convolve_stereo(right_80_filter_both, stereo_data_h, mode='full')
helicopter_left_160_both = convolve_stereo(left_160_filter_both, stereo_data_h, mode='full')

sound(helicopter_straight, rate=sr_h, label='helicopter straight ahead')
sound(helicopter_left_45_ITD, rate=sr_h, label='helicopter left 45 ITD')
sound(helicopter_right_80_ITD, rate=sr_h, label='helicopter right 80 ITD')
sound(helicopter_left_160_ITD, rate=sr_h, label='helicopter left 160 ITD')
sound(helicopter_left_45_IID, rate=sr_h, label='helicopter left 45 IID')
sound(helicopter_right_80_IID, rate=sr_h, label='helicopter right 80 IID')
sound(helicopter_left_160_IID, rate=sr_h, label='helicopter left 160 IID')
sound(helicopter_left_45_both, rate=sr_h, label='helicopter left 45 both')
sound(helicopter_right_80_both, rate=sr_h, label='helicopter right 80 both')
sound(helicopter_left_160_both, rate=sr_h, label='helicopter left 160 both')
helicopter straight ahead
helicopter left 45 ITD
helicopter right 80 ITD
helicopter left 160 ITD
helicopter left 45 IID
helicopter right 80 IID
helicopter left 160 IID
helicopter left 45 both
helicopter right 80 both
helicopter left 160 both

Part 2. Static sources using HRTFs

Download the HTTF archive from [https://drive.google.com/uc?export=download&id=1vFzSo-zlNFI-q2T9yvRRmOeZ-elZ8lW3 ]. In that directory you will also find code for the function load_hrtf which returns the left and right HRTF filters given as input a source’s azimuth and elevation. These filters will be much better than the ITD/ILD filters for localizing sounds.

Apply the HRTFs on the given sources and create 3D sounds that correspond to the locations given above. For each source, you will beed to convolve it with the left and right HRTF of the desired position and generate two sounds, one for each channel. Verify that they sound correct using headphones; are they better than before? What differences do you observe? When you use these make sure that the sample rates of the HRTFs and the sounds you convolve them with match.

In [26]:
def load_hrtf( ad, ed):
    # Return the HRTFs for a given azimuth and elevation
    #  function h,a,e = load_hrtf( ad, ed)
    #
    # Inputs:
    #   ad  is the azimuth to use in degrees (0 is front)
    #   ed  is the elevation to use in degrees (0 is level with ears)
    #
    # Output:
    #   l,r two 128pt arrays, first is left ear HRTF, second is right ear HRTF



    # Path where the HRTFs are
    p = './compact'

    # Get nearest available elevation
    e = max( -40, min( 90, 10*(ed//10)))

    # Get nearest available azimuth
    ad = np.remainder( ad, 360)
    if ad > 180:
        ad = ad-360
    if ad < 0:
        a = abs( ad)
        fl = 1
    else:
        a = ad
        fl = 0
    a = max( 0, min( 180, 5*(a//5)))

    # Load appropriate response
    h = np.fromfile( '%s/elev%d/H%de%.3da.dat' % (p, e, e, a), dtype='>i2').astype( 'double')/32768
    if fl:
        return h[1::2],h[::2]
    else:
        return h[::2],h[1::2]
In [37]:
hrtf_left30_ele30 = load_hrtf(-30, 30)
hrtf_right90_elem40 = load_hrtf(-270, -40)

helicopter_left30_ele30 = convolve_stereo(hrtf_left30_ele30, stereo_data_h, mode='full')
helicopter_right90_elem40 = convolve_stereo(hrtf_right90_elem40, stereo_data_h, mode='full')
sound(helicopter_left30_ele30,  rate=sr_h, label='helicopter left 30 elevation 30')
sound(helicopter_right90_elem40,  rate=sr_h, label='helicopter right 90 elevation -40')

fly_left30_ele30 = convolve_stereo(hrtf_left30_ele30, stereo_data_f, mode='full')
fly_right90_elem40 = convolve_stereo(hrtf_right90_elem40, stereo_data_f, mode='full')
sound(fly_left30_ele30,  rate=sr_f, label='fly left 30 elevation 30')
sound(fly_right90_elem40,  rate=sr_f, label='fly right 90 elevation -40')

crumble_left30_ele30 = convolve_stereo(hrtf_left30_ele30, stereo_data_c, mode='full')
crumble_right90_elem40 = convolve_stereo(hrtf_right90_elem40, stereo_data_c, mode='full')
sound(crumble_left30_ele30,  rate=sr_c, label='crumble left 30 elevation 30')
sound(crumble_right90_elem40,  rate=sr_c, label='crumble right 90 elevation -40')
helicopter left 30 elevation 30
helicopter right 90 elevation -40
fly left 30 elevation 30
fly right 90 elevation -40
crumble left 30 elevation 30
crumble right 90 elevation -40

The direction of the soulds convoluted with HRTF is more easier to be identified. On the other hand, the ones convoluted with ITD and IID filter can't tell the elevation and whether the sound comes from back or front.

Part 3. Dynamic Sources with HRTFs

In this part you will need to make a moving sound source. In order to do so we will make use of a fast convolution routine based on your STFT code from lab 1 (ha, you thought you were done with that!). In order to perform fast convolution we can perform an STFT of the sound to use, multiply each time frame of this transform with the DFT of the filter that we want to impose and then use overlap add to transform back to the time domain.

Start by taking each sound from above, and apply your STFT on it. Make sure that the size of the transform is the same as the HRTF’s filter length. The hop size should be the same as the DFT size and you will need to zero pad by as much as the DFT size in order to facilitate the tail of the convolution. Do not use an analysis/synthesis window.

Once you compute this STFT, go through its every time frame and element-wise multiply it with the desired HRTF filter to generate the STFT of the left and right sounds. Figure out which HRTF angle to multiply each frame with so that by the end of the sound you will have made it go around your head.

Once you perform these operations you will have generated two STFT matrices, one for the left channel and one for the right. Use your inverse STFT routine and play the stereo sound through your headphones. You should hear a convincing rendering of the original sounds circling around your head.

In [86]:
def stft(input_sound, dft_size, hop_size, zero_pad, window):
    result = []
    for i in range(0, len(input_sound), hop_size):
        slot = []
        start = i
        end = i + dft_size
        if end >= len(input_sound):
            slot = input_sound[start:]
            zero = np.array([0] * (end - len(input_sound)))
            slot = np.concatenate([slot, zero])
        else:
            slot = input_sound[start:end]
        result.append(np.multiply(np.fft.rfft(np.concatenate([np.multiply(slot, window), np.array(zero_pad)])), (hop_size * 2 / dft_size)))
    return result

def istft(stft_output, dft_size, hop_size, zero_pad, window):
    # YOUR CODE HERE
    result = np.array([0] * (hop_size * len(stft_output) - hop_size + dft_size), dtype=float)
    for i in range(len(stft_output)):
        each_wave = np.fft.irfft(stft_output[i], n=dft_size + len(zero_pad))[:dft_size]
        result[(i * hop_size):(i * hop_size + dft_size)] = np.add(result[(i * hop_size):(i * hop_size + dft_size)], np.multiply(np.multiply(each_wave, window), (0.2)))
        
    # Return reconstructed waveform
    return result[:hop_size * len(stft_output)]

def show_graph(sub_fig, freq_data, amp_data_len, sample_rate, mode="spectrogram", title=""):
    if mode == "spectrogram":
        transposed_data = np.log(np.absolute(freq_data).real).T
        t1 = np.linspace(0, amp_data_len / sample_rate, len(transposed_data[0]))
        t2 = np.linspace(0, sample_rate / 2, len(transposed_data))
        x1, x2 = np.meshgrid(t1, t2)
        sub_fig.pcolormesh(x1, x2, np.array(transposed_data))
        sub_fig.set_title(title)
        
def make_it_around(original_data, dft_size, hop_size, zero_pad, window):
    spectrogram = stft(original_data, dft_size, hop_size, zero_pad, window)
    applied_spectrogram_l = []
    applied_spectrogram_r = []
    
    for i in range(0, len(spectrogram)):
        hrtf = load_hrtf(i * (360 / len(spectrogram)), 0)
        left_hrtf_spec = np.fft.rfft(np.concatenate([hrtf[0], np.zeros(896, dtype=float)]))
        right_hrtf_spec = np.fft.rfft(np.concatenate([hrtf[1], np.zeros(896, dtype=float)]))
        applied_spectrogram_l.append(np.multiply(left_hrtf_spec, spectrogram[i]))
        applied_spectrogram_r.append(np.multiply(right_hrtf_spec, spectrogram[i]))
    result_h = [istft(applied_spectrogram_l, dft_size, hop_size, zero_pad, window), 
                istft(applied_spectrogram_r, dft_size, hop_size, zero_pad, window)]
    return result_h

        
dft_size = 512
hop_size = 256 
zero_pad = [0] * 512
window = signal.triang(dft_size) # No window

result_h = make_it_around(data_h, dft_size, hop_size, zero_pad, window)
result_f = make_it_around(data_f, dft_size, hop_size, zero_pad, window)
result_c = make_it_around(data_c, dft_size, hop_size, zero_pad, window)

#fig, (fig1_spec_h, fig1_spec_applied_h, fig1_sound_h, fig1_sound_applied_h) = plt.subplots(nrows=4)

#show_graph(fig1_spec_h, spectrogram_h, len(data_h), sr_h, "spectrogram", "helicopter origin")
#show_graph(fig1_spec_applied_h, appplied_spectrogram_h_l, len(data_h), sr_h, "spectrogram", "helicopter applied")
#fig1_sound_h.plot(data_h)
#fig1_sound_applied_h.plot(result_h[0])

#fig.set_size_inches(6, 12)
#fig.tight_layout()
#plt.show()

sound(data_h,  rate=sr_h, label='helicopter origin')
sound(result_h,  rate=sr_h, label='helicopter around')
sound(data_f,  rate=sr_f, label='fly origin')
sound(result_f,  rate=sr_f, label='fly around')
sound(data_c,  rate=sr_c, label='crumble origin')
sound(result_c,  rate=sr_c, label='crumble around')
helicopter origin
helicopter around
fly origin
fly around
crumble origin
crumble around

Part 4. Extra credit (required for 4-hour credit registrants)

Use the file in [https://drive.google.com/uc?export=download&id=1bxQkXcGa57S3G7IRWu9urS-jCPdf5Pif ] to make a short story. This is a 7-channel file of a scene from a really bad B-movie. Each channel contains a different sound. If you play that sound you probably won’t hear most of the content since you won’t have a 7 speaker setup, import it in a multi-channel editor such as Audacity and you will get a sense of what’s in there. Since it sounds so boring you need to add some reverb and 3D-locate the sounds so that it sounds more exciting. Try to make it sound better using what we have done so far. Keep in mind that you can add different amounts of reverb for each sound, and you can dynamically 3D place them as well.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()